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Abstract 

Background: Biological networks have a growing importance for the interpretation of high-throughput "omics" 
data. Integrative network analysis makes use of statistical and combinatorial methods to extract smaller subnetwork 
modules, and performs enrichment analysis to annotate the modules with ontology terms or other available 
knowledge. This process results in an annotated module, which retains the original network structure and includes 
enrichment information as a set system. A major bottleneck is a lack of tools that allow exploring both network 
structure of extracted modules and its annotations. 

Results: This paper presents a visual analysis approach that targets small modules with many set-based annotations, 
and which displays the annotations as contours on top of a node-link diagram. We introduce an extension of 
self-organizing maps to lay out nodes, links, and contours in a unified way. An implementation of this approach is 
freely available as the Cytoscape app eXamine 

Conclusions: eXamine accurately conveys small and annotated modules consisting of several dozens of proteins 
and annotations. We demonstrate that eXamine facilitates the interpretation of integrative network analysis results in 
a guided case study. This study has resulted in a novel biological insight regarding the virally-encoded G-protein 
coupled receptor US28. 
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Background 

High-throughput "omics" data provide snapshots of 
cellular states in a specific condition. Computational 
approaches can be used to relate these low-level mea- 
surements with high-level changes in phenotype. Tradi- 
tionally, these approaches were gene-centric and typically 
resulted in ranked lists of differentially expressed genes 
[1-3]. Later, gene-centric approaches were complemented 
by pathway- [4,5] and network-based methods [6,7] 
to provide inter-gene context for mechanistic insights. 
Pathway-based approaches identify overrepresented path- 
ways from databases such as the Kyoto Encyclopedia 
of Genes and Genomes (KEGG) [8]. Network-based 
approaches yield small, de novo subnetwork modules 
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that may span several known pathways, and reveal their 
crosstalk [9]. 

Extracted network modules are analyzed in the context 
of established gene annotations to hypothesize about the 
module's role in high-level cell conditions (see Figure 1). 
Genes are often related to very many terms (too many for 
human comprehension), most of which are likely irrele- 
vant to the analysis context. Therefore, overrepresentation 
analysis is performed to rank information items by their 
significance. These items originate from ontologies such 
as the Gene Ontology (GO) [10], which identifies cellular 
functions, processes and components that nodes relate to, 
or from KEGG [8], which relates nodes to pathways. This 
results in an annotated module, which retains the original 
network structure and includes enrichment information 
as a set system. 

Existing tools focus on visualizing large networks, and 
have only limited or separate set system support or no 
support at all. Our proposed visual analysis approach 
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Figure 1 Data and analysis pipeline. First, control and experimental samples are analyzed to estimate expression levels. Subsequently, gene 
expression differences (between experiment and control) and their significance are determined. These differences are then mapped to an interaction 
network, from which a module is extracted with overall significantly-differential gene expression. This module is annotated with overrepresented 
cell mechanisms from ontology and pathway databases. Finally, the enriched module undergoes iterative visual analysis via eXamine. 



displays sets as contours on top of a node-link layout (see 
Figure 2). It treats module edges and annotation sets in a 
unified way, and contributes the following to the analysis 
of annotated modules: 

• Identification of elementary module analysis tasks 
and their composition into a visual analysis process; 

• Extension of the self-organizing maps (SOM) 
algorithm to lay out module interactions and 
annotations in a unified approach; 

• Implementation in the form of the Cytoscape app 

eXamine; 

• Demonstration of eXamine via a guided study of an 
annotated module that is activated by the 
virally-encoded G protein-coupled receptor US28; 
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Figure 2 Visualization of an annotated module. Interacting 
proteins with a selection of three subsets, corresponding to 
overrepresented KEGG pathways. The visualization consists of a 
combination of a node-link diagram and an Euler diagram. 



• Discussion on how eXamine facilitates the analysis 
process. 

Data characteristics 

The annotated modules — targeted by the presented 
method — have the following characteristics. 

Dl Small and sparse network topology, in which genes 
and interactions number in the dozens; 

D2 Many annotation sets, outnumbering gene 
interactions; 

D3 Annotation sets vary in cardinality, from a single 

node to the entire module; 
D4 Annotation sets overlap often. 

Integrative network analysis methods produce small and 
sparse subnetwork modules (Dl), rather than large lists of 
differentially expressed genes. Embedding the module in 
a rich context of annotations on overlapping sets of genes 
is a typical next step to gain insights in the underlying 
biology (D2, D3, D4). 

Analysis tasks 

The focus (or perspective) of analysts alternates between 
genes (and interactions within a module) and annota- 
tion sets. Important analysis tasks are supported for each 
of these data aspects to enable an analyst to hypothe- 
size about the role of an extracted module in light of 
experimental conditions. 
For genes, analysts want to determine: 

Gl Level of differential expression: under- or 

over-expressed, or insignificant; 
G2 Interacting neighbors; 
G3 Annotations (set memberships); 
G4 Annotations shared with other genes. 
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Single genes can become the focus of attention dur- 
ing the analysis process within the context of the module. 
The fact that a gene is part of a module does not imply 
that its under- or overexpression is significant. However, 
information (Gl) about differential expression enables 
the elucidation of a gene's presence in the module. For 
example, it could be the case that a gene is not differ- 
entially expressed significantly itself, but that it is still 
part of a module, because it connects two differentially 
expressed submodules. An indirect involvement of the 
gene in a module mechanism is therefore likely. Neigh- 
boring genes might also become interesting (G2), as are 
any mechanisms that it is associated with already (G3), 
and the mechanisms that it shares with other genes in the 
module (G4). 

For annotation sets, analysts want to determine: 

Al Significance of overrepresentation; 
A2 Gene memberships. 

If a specific gene is interesting, its annotations might 
be too (G3 and G4). Annotation sets themselves can have 
such significance (Al) that they become interesting, which 
then translates to genes contained in them (A2). Both sig- 
nificance in terms of an associated p-value and subjective 
significance are of importance to divide attention between 
annotation sets. 

For interactions, analysts want to determine: 

L Annotation transitions between interacting genes. 



A change between annotations (L) may occur when the 
focus on a gene shifts to a neighboring gene (G2), which is 
of importance to an analyst to judge the role and relevance 
of the neighboring gene in the module. 

Related work 

Network visualization and tools. Many advanced tech- 
niques for the visualization of network topology have 
been developed [11-13], but few have been transferred 
to readily available tools. On the other hand, there are 
many tools for interpreting and exploring biological net- 
works [14], including the popular open source platforms 
Cytoscape [15] and PathVisio [16]. However, these cur- 
rently provide only limited capability to visualize anno- 
tated modules. PathVisio is a pathway analysis approach, 
in which sets are restricted to subsets of static, pre-defined 
individual pathways, and set membership is conveyed via 
node colors. Cytoscape's group attributes layout can be 
used to visualize partitions by showing disjoint parts in 
separate circles, but it does not support overlapping sets. 
The Venn and Euler diagram app [17] for Cytoscape does 
support overlapping sets, but it can handle only four at 
the same time (see Figures 3(a) and (b)). In this app, 
network and sets are visualized separately: set member- 
ship is conveyed by selecting a set and its corresponding 
nodes are highlighted in Cytoscape's network view. The 
RBVI collection of plugins [18] facilitates creation and 
editing of Cytoscape groups, and provides a group viewer 
that relies on aggregation of groups into meta-nodes. 




(a) (b) (c) 

Figure 3 Comparison. Annotated module visualization using Cytoscape's Venn and Euler diagram app: (a) Venn diagram and (b) Euler diagram. The 
number of displayed sets is limited to four and no network structure is shown, (c) Module laid out by one of Cytoscape's built-in force-directed layout 
algorithms and BubbleSets superimposed on the network (same color scheme as in Figure 8(b)). Note that it is not immediately apparent that the 
nodes in the ,6-catenin set (blue) form a subset of Adherens junction (yellow), because the BubbleSet approach applies no explicit nesting of subsets. 
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These meta-nodes can be visualized as standard nodes, 
as nodes containing embedded networks, or as charts. 
This approach, however, does not allow for visualization 
of overlapping sets. 

Set system visualization. In the information visualiza- 
tion field, Euler diagrams are used for the intuitive visu- 
alization of set systems [19-21], in which items belonging 
to the same set are denoted by contours. Variants of these 
approaches visualize sets over items with predefined posi- 
tions, e.g., over a given node-link visualization of a net- 
work. These methods range from connecting these items 
by simple lines (LineSets) [22], via colored shapes that are 
routed along the items (Kelp Diagrams) [23] and contours 
around the items (BubbleSets, see Figure three(c)) [24,25] 
to hybrid approaches (KelpFusion) [26]. Visualizing an 
annotated module, however, requires an integrated lay- 
out of both its network and set system topologies, which 
is not possible with these approaches. Euler diagram 
methods focus on the layout of set relations at the expense 
of network topology. Likewise, laying out the network 
before superimposing set relations will emphasize net- 
work topology to the detriment of the set system. Some 
techniques exist that provide such integrated layouts 
[27-30], and which include aesthetic concerns and design 
of visual metaphors [31]. However, these approaches 
assume constraints on the network and set system 
topologies, e.g., strict partitions and no overlapping 
sets, and they are therefore not applicable to our 
problem. 

Method and implementation 

Visualizing an annotated module amounts to visualiz- 
ing a hypergraph consisting of binary edges (interactions) 
between nodes (genes) and w-ary edges (annotation sets). 
Analysis tasks G2-G4 and A2 establish the equal impor- 
tance of associating interactions and annotation sets, 
which reflect on both the layout as well as the visualization 
of the hypergraph. Therefore, as opposed to combining 
multiple existing techniques — e.g., a force simulation to 
position the nodes according to the binary edges [32], a 
node overlap removal algorithm to keep nodes identifi- 
able [33], and subsequent construction of a density field 
to derive contours for annotation sets [24] — our approach 
relies on a unified algorithm that treats binary and 
w-ary edges on equal terms. This allows us to compute a 
balanced layout, and also to choose suitable representa- 
tions for the binary and w-ary edges. Mathematically, we 
achieve this by assigning a bit vector t = {t\, t%, . . . , tj\t) 
to every node t e V (the module genes) that encodes 
its membership in binary and M-ary edges Si, 52, ... , Sm- 
That is, U = 1 if t e S; and ti = 0 if t S t . 

To make this representation more concrete, consider 
the annotated module shown in Figure 2. The nodes are 



represented as the set V = {Calml, Calm2, Calm3, Kras, 
Nr3c2, Plcb4}. There are seven sets representing the edges 
and three sets representing pathway memberships. The 
edge sets are Si = {vi,v 4 }, S 2 = {vi,v 6 }, S 3 = {v 2 , v 4 }, 
S 4 = {v 2 , v 6 }, S 5 = {v 3 , v 4 }, S 6 = {v 3 , v 6 }, and S 7 = 
{v 4 , vs\. Note that nodes v 4 {Kras) and ve (Plcb4) have 
some additional outgoing edges, but their targets are not 
visible in the image. Therefore, we ignore these edges in 
this example. The pathway memberships are the Glioma 
set Ss = {vi,V2,V3,V4.}, the Long-term potentiation set 
S9 = {vi,V2>V3,v<i,V(,}, and the GnRH signaling pathway 
set S10 = {v\, Vi, V3, V4, V(,}. Now, for example, node vs gets 
assigned the bit vector t„ 5 = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0) and 
node v 6 the bit vector t„ 6 = (0, 1, 0, 1, 0, 1, 0, 0, 1, 1). 

This high-dimensional representation is then used to lay 
out the nodes without overlap, the binary edges as curves, 
and the w-ary edges as contours. 

Extension to self organizing maps 

Self Organizing Maps (SOMs), introduced by Kohonen 
[34], are artificial neural networks that are used to map 
high-dimensional data items to discretized low dimen- 
sion. SOMs are used in a visualization setting to cluster 
similar items together in a 2D embedding, which results in 
a landscape of items based on their features [35,36]. Typ- 
ical SOMs consist of a square grid of size N x N with 
a neuron n Xi y e[0..1] at every grid cell. A neuron n x>y 
is a bit vector of size M whose dimension matches the 
data items' dimensions. In our case, the data items T cor- 
respond to the set of nodes V in the annotated module. 
The training algorithm applies unsupervised reinforce- 
ment learning in an iterative fashion: at every iteration 
i G {1, . . all data items i e T are considered and the 
neuron that matches t most closely is determined using 
a distance function such as the Euclidean or Manhattan 
norm. This neuron and its neighboring neurons within 
radius r,- are updated to match t even more closely by 
setting their respective vectors q to q + ai(t — q) — see 
Figure 4(a). In early iterations i, the trained neighborhoods 
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Figure 4 Training neuron n XiY . (a) The neighborhood within range 
f\ is trained (colored gray), (b) Certain tiles are already reserved 
(colored red) in the RSOM algorithm, item f therefore trickles 
outwards to the best matching free spots (outlined). 
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are large with r; close to the grid size N and the training 
strength a; close to 1. The parameters r; and a; decrease 
monotonically with increasing i. As such, items that differ 
strongly will distribute across the map to establish their 
own regions in the grid at early stages. Items with smaller 
differences are separated along the grid at a more local 
level as the training iterations progress. 

Reservation-based training. Similar items may end up 
at the same grid position in a standard SOM. This issue is 
usually solved by showing aggregate depictions of items, 
but we need to have separate depictions without overlap 
to support tasks G1-G4. Therefore, each item has to map 
to a unique grid position. We achieve this by altering the 
training algorithm: 

Algorithm RSOM(T) 
1. for i <r- 1 to / 



2. do Initialize copy U of T and clear neuron reser- 

vations. 

3. while U contains items 

4. do Draw and remove item t from U. 

5. Find unreserved neuron n x , y with 
smallest distance d(t, n x>y ). 

6. Reserve n X} y for t. 

7. for any neuron q within range r; from 
(x,y) 

8. do q <r- q + cti{t — q) 



The algorithm assigns items to a unique neuron 
after every training iteration, because, once a neuron is 
reserved by an item, subsequent items will ignore it. This 
causes a flooding effect where similar items end up in the 
same area of the grid and trickle outwards as the area 
becomes more crowded — see Figure 4(b). 

Configuration. The metric distance form of cosine sim- 
ilarity is used as the distance function d, i.e. d(q,p) = 
cos~ 1 ((g r • />)(|<7Hj?|))7r -1 . This measure outperforms the 
Euclidean and Manhattan norms in high-dimensional 
spaces. The SOM is trained with a learning strength and 
neighborhood range that decrease linearly with increas- 
ing iteration i. A standard choice is a; = c • (1 — i/I) and 
ri = 1(1 — i/I) ■ N\, where c € (0..1) is a small con- 
stant that determines the initial training strength. We use 
N = 2|T| for the number of neurons and iterations, bal- 
ancing node placement freedom versus required display 
space, and / = 10 6 /|T| for a gradual and accurate training, 
respectively. 

Layout preservation. A new layout has to be computed 
whenever the user selects or deselects a set. The new lay- 
out should change little in comparison to the old layout 
to preserve the user's mental map. This is achieved by a 



simple addition to the SOM algorithm, where a new SOM 
is initialized with the previous configuration of the neu- 
rons, i.e., an item that was positioned at n XiJ in the old 
SOM is placed at n Xt y in the new SOM and its neighbor- 
hood is trained according to the new bit vector of the item. 
The new SOM retains much of the initial configuration 
by starting the training factor a; at c = 0.01. Naturally, 
this imposes a trade-off between layout quality and con- 
servation. The layout will sometimes change strongly to 
accommodate the addition of a set that contains many 
items. In contrast, the layout can be retained if only a 
small set that does not alter much of the topology is added. 
This approach does not consider a history of topological 
changes, as is done in online graph drawing [37] to capture 
temporal dynamics, but is sufficient to maintain a stable 
and interactive environment. 

Set dominance. The user is enabled to make a certain set 
more dominant in the layout by having the training algo- 
rithm place the items of that set closer to each other than 
the items of other sets. This relies on weighting the com- 
ponents of the item bit vectors: every Si is given a weight 
Wi with Wi = 1 initially. The bit vectors are augmented to 
incorporate these weights: ti = Wt if t e 5/ and £; = 0 if 
t $ St. The bit vector component of Si will therefore play 
a more prominent role in distance metric d when the user 
increases W;— see Figure 5. 

Assigning greater weight to a set improves the quality of 
its layout by coalescing its elements, which aids tasks G4 
and A2. However, it also degrades the layout quality of 
other sets and links when their topology conflicts with the 
prioritized set. This stems from the difficulty of project- 
ing elements from a high-dimensional space down to a 
two-dimensional space, which sometimes results in a sub- 
optimal layout per set. Interactive manipulation provides 
a way to assign different priorities to sets, and improve 
their layouts. 

Contours. The SOM's neuron grid is used to define the 
contours representing the active set system. Let S; be an 
active set. The corresponding z'-th components of the neu- 
rons define a scalar field that forms a fuzzy membership 
landscape for Sj. This field is similar to the density field 
used in Bubble Sets [24]. Now, the inclusion of the grid 
tile of neuron n in the contour body is determined by 
imposing a threshold, of for example j, on the 2-th compo- 
nent (see Figure 6(a)). The contour can then be tightened 
to reduce sharp corners by including parts of tiles that are 
free of items, as illustrated in Figure 6(b). 

After establishing the layout of the contours, we apply 
geometric post processing steps [23] to improve aesthet- 
ics, where all sets are legible (tasks G3 and A2) and 
contours form clear boundaries underneath interactions 
(task L). Sharp corners of the initial contours are rounded 
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(b) 

Figure 5 Changing the dominance of a set. (a) Highly dominant 
set, drawing proteins of the set together, (b) Non-dominant set, 
where the network topology fully defines the layout. 



by a dilation of r, erosion of 2r, and subsequent dilation 
of r (see Figure 6). Here dilate and erode are equivalent 
to Minkowski sum and Minkowski subtraction operators 
with a circle of radius r [38], respectively. In addition, the 
contours are nested by applying different levels of ero- 
sion, enforcing a certain distance between them. The thick 
colored ribbons in Figure 7 are obtained by taking the 
body b of a contour, eroding it to get a smaller body b e , 
and taking the symmetric difference b — b e of b and b e to 
effectively cut b e out of b. Here, the extent of the erosions 
and dilations (radius r) is bounded by a fraction of the 
grid's tile size. This guarantees that items are contained by 
a contour of Si if, and only if, these items are contained 
by Si. 



Set contours are drawn in descending nesting order, 
which is defined by their different erosion levels; the 
largest contour is drawn first and the smallest contour 
last. The contour ribbons are assigned unique colors per 
set and are drawn fully opaque to prevent any confu- 
sion caused by blended colors. Occlusion is mitigated 
by limiting the width of the ribbons. Finally, the con- 
tours are drawn a second time as dashed lines such 
that occluded contour sections can be inferred — see 
Figure 7. 

Implementation 

We have implemented the technique in a Cytoscape app, 
and have emphasized simplicity of interaction and visual 
presentation in the design. The available sets are sorted 
by significance and listed in the set overview on the left, 
where the significance of a set is visualized as a circle, 
scaled logarithmically and accompanied by its scientific 
exponent as text (task Al). The user may select sets for 
inclusion in the annotated network visualization to the 
right — see Figure 8(c). All described functionalities can 
be used at interactive speeds for networks up to dozens 
of nodes, edges, and active sets, including laying out the 
network with the RSOM training algorithm. Geometric 
operations on the contours, such as dilations and erosions, 
are performed via Java Topology Suite [39]. 

Interaction. Interactions consist of simple mouse actions 
(see the video in the Supplemental Material). The 
inclusion of a set in the network visualization is toggled 
via the set's label in the set overview or its contour in 
the network visualization (task A2). Additional informa- 
tion about a set or node may be obtained via a hyperlink 
to a web page provided in the input data, enabling quick 
access to external information sources such as the KEGG 
website. This approach keeps the tool flexible, i.e., the tool 
itself does not have to be altered every time a new kind of 
set or node from a different database is loaded. 

The links of a node are emphasized when it is hovered 
over (see Figure 9(a)) such that its direct neighborhood 
can be discerned from its surroundings (task G2). More- 
over, sets that contain the hovered node are highlighted 
as well. Likewise, links can be hovered to highlight their 
nodes and common sets. Vice versa, the contours of a 
set are emphasized and its comprising nodes are high- 
lighted when it is hovered over (see Figure 9(b)). This 
provides immediate feedback to the user about node-set 
relations (tasks G3 and A2) without having to select a 
set and consequently changing the layout of the network 
visualization. 

The lists of annotations sets can be expanded and col- 
lapsed by clicking on their headers, and scrolled down- 
ward to sets of lower significance by turning the mouse 
wheel. The set circles that convey significance remain 
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(a) (b) 

Figure 6 Derivation of contours for set S,-.The darkness of a tile represents the value of the neurons' i-th component, the thick black line is the 
contour, dots represent items that are in 5,, and white dots are items that are not in 5,. (a) Contour that results from the union of tiles with a value 
above a certain threshold, (b) Refined contour with shortcuts across free tiles. 



visible at all times, grouping at the list top and bottom, 
to guarantee the depiction of all set memberships when a 
node is hovered. 

The user can adjust the dominance of a set by spinning 
the mouse wheel while hovering over either the set's label 
in the set overview or contour in the network visualiza- 
tion. This enables the user to give a set a central role in the 
layout (see Figure 5(a)) or to remove any of its influence 
(see Figure 5(b)). 

All changes to the visualization caused by interac- 
tion are animated. Colors and positions of items are 
altered gradually. Link layout changes are animated by 
interpolating their control points, while contour layouts 
are handled by fading out the old contour and fading 
in the new contour. The use of layout preservation, as 
described previously, in combination with animations 
helps to preserve the user's mental map. 




Figure 7 Geometric refinement of set contours after initial 
layout. Corners are smoothened by dilation and erosion operations, 
and contours are given a thick and colored internal ribbon. Unique 
erosion levels create distance between contour outlines, and contour 
overlap is emphasized by dashed lines. 



Color. Unique, distinguishable colors are derived from 
Color Brewer palettes [40], and assigned to annotation 
sets in a cyclic manner to avoid assigning the same color 
consecutively. In addition, large differences in contrast are 
avoided. For example, text and set outlines are colored 
dark gray instead of black to reduce their visual domi- 
nance. Black is only used when items are hovered over 
or highlighted such that they attract attention, as shown 
in Figure 9. Moreover, labels of selected sets (in the set 
overview) are emphasized with a more intense black color 
to ensure that they are readable in a colored surround- 
ing. Node labels have a white background to make sure 
that their text is legible when drawn on top of a set rib- 
bon with a dark color. Likewise, links have halos that make 
them easier to distinguish and their intersections more 
pronounced. 

Cytoscape integration. eXamine is tightly integrated 
into Cytoscape. Cytoscape's group functionality is used to 
represent sets and we rely on the table import functional- 
ity for importing both the set and node annotations. The 
user is also able to group sets into different categories. The 
Cytoscape node fill color map attribute is used to color 
the nodes in eXamine according to gene expression score 
(task Gl). The user therefore has the freedom to define 
the desired color map via Cytoscape. The user can invoke 
eXamine on the currently selected nodes via the eXamine 
control panel. There the user can select which categories 
to show as well as the number of sets per category. In 
addition, the user can specify that the Cytoscape selec- 
tion should be updated to match the union or intersection 
of the selected sets in eXamine. This enables the use of 
eXamine with any kind of module extraction algorithm 
and/or filter method in Cytoscape, which includes manual 
node selection. 
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Figure 8 Case study snapshots. Gene differential expression is shown as a colored box drawn around the node label (green for under-expression 
and red for over-expression), (a) The annotated module after tagging of the two familiar pathways Pathways in cancer and Phosphatidylinositol 
signaling system in CI. (b)The annotated module after tagging functions Beta-catenin binding and Growth factor activity in C3 and C4. (c)The fully 
annotated module, including annotation set overview, from which the hypothesis of C5 is derived. 



Results: a case study of US28 mediated signaling 

We demonstrate how a domain expert can use eXam- 
ine by working out a case study in which a data set is 
re-analyzed (this work was done by the co-authors with 
biological expertise). While this data set has been studied 
extensively, it was possible to derive a new hypothesis via 
eXamine 

The Human Cytomegalovirus (HCMV) is a highly- 
contagious herpes virus [41]. Infection with HCMV in 



healthy humans usually does not result in symptoms. 
However, in humans with a compromised immune sys- 
tem the virus is correlated with diseases such as hepatitis 
and retinitis [42]. In addition, HCMV gene products have 
been detected in various tumors even though HCMV is 
not considered to be an oncogenic virus. Experts therefore 
hypothesize that the virus may act as a stimulating factor 
during onset and development of cancer without being a 
root cause [43-45]. 
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Figure 9 Item highlighting, (a) Hovered protein (Met) with emphasized interaction links to its neighbors on the right and emphasized sets (KEGG 
pathways) that contain this protein on the left. Sets outside of the list scope are grouped as markers at the top and bottom, where one set in the 
bottom group is emphasized, (b) Hovered set (Pathways in cancer) with emphasized member proteins, interactions, and contour. 



HCMV is responsible for the production of several 
viral G protein-coupled receptors (vGPCRs). Of these 
vGPCRs, US28 is the most studied and is characterized 
as chemokine sink [46]. Chemokines are signaling pro- 
teins that induce cell migration. Moreover, US28 hijacks 
the host cell's signaling pathways, stimulates prolifera- 
tive signaling pathways [47-51]. Previous studies focused 
on transcriptome analysis to evaluate pathways that are 



affected by US28. Differentially expressed genes involved 
in HCMV-induced disease symptoms were identified and 
related to known pathways [49,50]. However, this analy- 
sis did not include network-based module extraction and 
enrichment. 

To identify additional deregulated signaling due to 
US28, we analyzed the same data overlaid on the KEGG 
mouse network [8]. The network consisted of 3863 nodes 
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and 29293 edges. Gene j?-values, reflecting whether genes 
are significantly differentially expressed, were derived 
using RMA [52] and LIMMA [53]. Heinz [7], a tool 
for identifying differentially expressed modules, was then 
applied using a false discovery rate of 0.0007. This resulted 
in a module of 17 proteins. Finally, enrichment analysis 
using TopGO [54] was performed to annotate this mod- 
ule with enriched GO-terms and KEGG pathways (see 
Figure 8). 

These data processing steps correspond to the initial 
steps in Figure 1. The subsequent analysis of the annotated 
module aims at obtaining new insights about US28- 
mediated signaling. The analysis follows the visual ana- 
lytics cycle consisting of observation, knowledge, questions 
and exploration, finalized by a hypothesis. All analysis 
steps are shown in the screencast of Additional file 1. 

CI Two familiar pathways 

Observation. The KEGG pathway annotation sets show 
significant presence of Pathways in cancer and Phos- 
phatidylinositol signaling (p-values of 5.6 • 10~ 6 and 1.0 • 
10~ 6 , respectively). 

Knowledge. An oncomodulatory role has been proposed 
for US28 [43-45], which coincides with the pres- 
ence of Pathways in cancer and makes the genes 
annotated by this term of interest. Phosphatidylinositol 
signaling corresponds to previous work linking US28 to 
Phosphatidylinositol-mediated calcium responses [47,55]. 

Question. Which parts of the module are involved in 
Pathways in cancer and Phosphatidylinositol signaling! 

Interaction. Tag the Pathways in cancer and Phos- 
phatidylinositol signaling annotation sets (see Figure 8(a)). 

C2 Choosing sides 

Observation. Clear division of the module is appar- 
ent after tagging the two familiar pathways. Genes Arf6, 
Csnk2al, Csnk2al, Ipmk, Nr3c2 and Rockl are not part of 
the pathways but have direct, unambiguous interactions 
with either of the pathways. 

Knowledge. Because of the known involvement of US28 
in Phosphatidylinositol signaling, we do not focus on the 
genes of this pathway (Calml..3, Plcb4, Pip5kla), nor 
on the directly interacting genes (Arf6, Ipmk, Rockl). 
Instead, the Pathways in cancer genes Kras, Met, Figf, Hgf, 
Fgf7, Ctnnbl and Tcf7ll, and directly interacting genes 
Nr3c2 and Csnk2al may lead to new insights in US28- 
mediated signaling and ultimately the oncomodulatory 
roleofHCMV. 



Question. Do any of the aforementioned genes in or 
adjacent to Pathways in cancer lead to new insights in 
US28-mediated signaling? 

Interaction. Hover over the genes in and close to Path- 
ways in cancer to determine mechanisms of interest. 

C3 A twist of (i-catenin 

Observation. The genes in Pathways in cancer can be 
divided roughly into two subsets: those that are anno- 
tated by growth-factor activity and those annotated by 
/3-catenin binding (see Figure 8(b)). Csnk2al, Tcf7ll and 
Met are part of the latter annotation set, where Tcf7ll 
and Csnk2al are down- and up-regulated, respectively. 
Expression of the neighboring Ctnnbl (/i-catenin) is up- 
regulated. 

Knowledge. /3-catenin signaling results in elevated pro- 
tein levels of the TCF/LEF transcription factor family that 
contains the protein encoded by Tcf7ll. Although Tcf7ll 
is down-regulated, a recent study shows that this is not 
reflected at the protein level and that US28 induces /i- 
catenin signaling [51]. In the same study, involvement of 
WNT/Frizzled via the canonical signaling pathway was 
ruled out and a hypothesis stating that US28-mediated 
signaling of /i-catenin proceeds via ROCK1, which is also 
present in the module, was postulated. 

Question. Are there alternative mechanisms explaining 
the activation of /i-catenin? 

Interaction. Tag the Growth factor activity annotation 
set (see Figure 8(b)). 

C4 Growing knowledge 

Observation. Fgf, Hgf and Figf are annotated with 
Growth factor activity and connected to /i-catenin via Met. 

Knowledge. MET is a receptor tyrosine kinase, whose 
only ligand is HGF. Therefore we can rule out the links 
from Met to Fgf and to Figf. In fact, these links are artifacts 
of how the mouse network was constructed from KEGG 
pathways. These artifacts often link whole groups of genes 
such as, in this case, growth factors to receptor tyrosine 
kinases. 

Question. Does the Hgf -Met axis relate to /i-catenin 
activation? 

Interaction. Hover over Met and Ctnnbl (/i-catenin). 
CS New insights 

Observation. Met and /i-catenin are both part of the 
Adherens junction pathway, as are Tcf7ll and Csnk2al (see 
Figure 8(c)). 
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Knowledge. Adherens junctions bind two cells together, 
keeping multiple cells in place. Alternative mechanisms 
have been described that explain /J-catenin activation via 
the release of /3-catenin from cell to cell adherens junc- 
tions (e.g. [56]). US28 promotes cell migration [57,58], 
which causes the loss of cell to cell contacts with subse- 
quent release of /3-catenin into the cytoplasm. This may 
explain increased levels of /3-catenin as found previously 
[51]. 

By requesting additional information for Adherens junc- 
tion via eXamine, showing an external website by KEGG, 
we find an indirect connection between Met and /S- 
catenin in the pathway (see Figure 10). Activation of MET 
via HGF mediates the release of /3-catenin from adherens 
junctions, resulting in increased TCF/LEF levels [59,60]. 

Hypothesis. Combining this with the growth factor 
observations of C4 leads to the following hypothesis. 

• US28-mediated up-regulation of Hgf results in 
elevated levels of the corresponding HGF protein; 

• The subsequent activation of MET results in the 
release of /3-catenin into the cytoplasm; 

• Subsequent translocation into the nucleus leads to 
enhanced TCF/LEF activation. 



virus achieves its oncomodulatory role and how this can 
be disrupted. 

Discussion 

The analysis tasks described in the background section 
guided the design decisions that we have taken in the 
implementation of eXamine. These decisions are moti- 
vated via the analysis cycles of the US28 case study. 

Overview. The benefit of a spacious annotation set 
overview follows from the first cycle (CI), in which the 
categorized, ranked, and legible annotation lists enable 
the fast recognition of two familiar and significantly rep- 
resented pathways (task Al). Subsequent tagging of the 
two pathways reveals their module genes (task A2) and 
concisely drawn contours emphasize the division of the 
module into two parts and some additional genes that are 
not part of the pathways. 

An annotation table, separate of the network, would not 
have made this division as apparent. The main reason 
is that annotation set transitions along gene interactions 
are not explicit in such a representation. In contrast, 
such cross-contour interaction links are clearly visible in 
eXamine (e.g. the transition from Kras in Pathways in 
cancer to Nr3c2 outside of Pathways in cancer). 



Synopsis 

We are currently validating the hypothesis experimen- 
tally. Preliminary results indicate that the up-regulation of 
Hgf is indeed reflected at the protein level. Should this 
hypothesis turn out to be true, we would obtain crucial 
insights into one of the mechanisms by which the HCMV- 
encoded chemokine receptor US28 rewires cellular sig- 
naling. Ultimately, we would like to understand how this 



Annotated genes. The need to focus on specific genes 
and their properties appears in the second analysis cycle 
(C2), in which genes of Pathways in cancer are inspected 
for annotations of interest (task G3). Highlighting anno- 
tations by hovering over genes enables fast identification 
of relevant annotations in the stable overview that ori- 
ented the analyst in CI. Vice versa, hovering an anno- 
tation of interest (fi-Catenin binding) confirms that it is 



Formation of A Js 




CS:k2.3l 



Figure 10 Connection between Met and /?-catenin. Proteins that are associated to the selected Adherens junction at the left and corresponding 
KEGG pathway information at the right, where reactions catalyzed by module proteins are marked in red. Activation of MET by its ligand HGF results 
in the phosphorylation of /i-catenin.This in turn results in its release from cadherin-complexes on the cell membrane into the cytoplasm. 
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shared by Csnk2al, Tcf7ll, and Met (task G4). The same 
observations could have been made from an annotation 
table. However, the topological characteristics of these 
three genes would have been harder to discern, i.e., their 
direct interaction with Ctnnbl (task G2). This also applies 
to other set visualizations without depiction of network 
topology, such as Venn or Euler diagrams, as shown in 
Figure 3(a) and (b). To make the topology of the gene 
interactions more explicit, a node-link visualization could 
be used. For example, Figure 3(c) shows the module laid 
out by one of the built-in force-directed layout algorithms 
of Cytoscape with all five annotation sets superimposed 
as BubbleSets. However, the structure of the annotation 
sets is hard to discern, and it is not immediately clear that 
nodes belonging to the /3-catenin binding set (blue shape) 
form a proper subset of the Adherens junction set (yellow 
shape). 

Integration. The third cycle (C3) shows the importance 
of gene expression values (task Gl), which is not limited 
to the interpretation of genes in isolation but along mul- 
tiple genes, their interactions, and shared annotation sets. 
The importance of integrated support for all analysis tasks 
follows from the remaining cycles (C4-C5), where multi- 
ple deductions are made in succession via multiple tasks. 
Here, tagging relevant pathways enables the analyst to 
build up a context for making deductions. 

Limitations. eXamine is designed to accurately convey 
small and annotated modules, consisting of up to about 
thirty proteins and categories of up to about twenty anno- 
tations (note that these limits are not hard). The case 
study shows that common analysis tasks for these mod- 
ules are covered. Scalability is a concern as our approach 
focuses on small modules to enable accurate depiction of 
sets contours; it is not possible to construct a comprehen- 
sive layout if the module consists of hundreds of proteins 
or if there are dozens of annotation sets to visualize at the 
same time. Both aspects make visual analysis ineffective. 
This is a natural limitation of any visualization approach 
based on node-link diagrams and set contours, however. 

Our technique relies on a focus and context approach, 
in which the network and set system has been pruned 
down to the most relevant components first. Communi- 
cating small-scale information is given priority to support 
hypothesis generation at the level of individual proteins 
and their interactions, as follows from the targeted anal- 
ysis tasks. Nonetheless, the tool is capable of visualizing 
modules of up to a hundred proteins, albeit with less 
legibility of interactions and annotations. 

The integration of eXamine into Cytoscape mitigates 
many scalability issues. Cytoscape, for example, provides 
a global view of the network, in which the user can zoom 
in on smaller subnetworks for more in-depth analysis 



by eXamine. In addition, the integration into Cytoscape 
provides access to further analysis algorithms. 

The extended SOM algorithm embeds an annotated 
module to reflect its topology, i.e., the distances between 
its proteins based on common interactions and annota- 
tions. This does not guarantee optimal aesthetics how- 
ever, and unnecessary link and contour intersections can 
sometimes occur. The analysis tasks targeted by eXam- 
ine are not much hampered by such intersections since 
all interactions, annotations, and their interplay remain 
pronounced. However, to communicate analysis results, 
aesthetics might need further improvement. This could 
be done by weighing aesthetic criteria such as the num- 
ber of intersections and shape complexity against each 
other, and formulating this as a combinatorial optimiza- 
tion problem. The associated algorithms [11] are often 
complex, and it is not so easy to integrate them into an 
interactive system. 

Application to other domains. eXamine is not limited to 
the analysis of enriched protein modules nor to data from 
the biological domain. It can be applied to any small net- 
work module that is accompanied by a set system, such as 
a social circle that consists of people, their relationships, 
and common interests. 

Conclusions 

We have proposed a visualization approach that enables 
the analysis of small and annotated network modules, and 
have implemented this in the Cytoscape app eXamine. 
Our approach displays sets as contours on top of a 
node-link layout. We have introduced an extension to 
the self-organizing maps algorithm to lay out module 
edges and annotation sets in a unified way. The added 
value of our approach has been demonstrated in a case 
study of a US28-mediated signaling module, in which a 
novel hypothesis about the way US28 induces /3-catenin 
signaling has been derived. 

Availability and requirements 

Project name: eXamine 

Project homepage: http://apps.cytoscape.org/apps/ 
examine 

Operating system(s): all 
Programming language: Java 
Other requirements: Cytoscape 3.x 
License: GPL2 

Any restrictions to use by non-academics: None 
Additional file 



Additional file 1 : Screencast. Screencast of interactive analysis in 
eXamine for the US28 case study. 
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